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Abstract 

We consider the problem of estimating Shannon's entropy H from discrete data, in cases where 
the number of possible symbols is unknown or even countably infinite. The Pitman- Yor process, a 
generalization of Dirichlet process, provides a tractable prior distribution over the space of countably 
infinite discrete distributions, and has found major applications in Bayesian non-parametric statistics 
and machine learning. Here we show that it also provides a natural family of priors for Bayesian 
entropy estimation, due to the fact that moments of the induced posterior distribution over H 
can be computed analytically. We derive formulas for the posterior mean (Bayes' least squares 
estimate) and variance under Dirichlet and Pitman- Yor process priors. Moreover, we show that a 
fixed Dirichlet or Pitman- Yor process prior implies a narrow prior distribution over H, meaning the 
prior strongly determines the entropy estimate in the under-sampled regime. We derive a family of 
continuous mixing measures such that the resulting mixture of Pitman- Yor processes produces an 
approximately flat prior over H. We show that the resulting Pitman- Yor Mixture (PYM) entropy 
estimator is consistent for a large class of distributions. We explore the theoretical properties of 
the resulting estimator, and show that it performs well both in simulation and in application to 
real data. 

Keywords: entropy, information theory, Bayesian estimation, Bayesian nonparametrics, Dirichlet 
process, Pitman- Yor process, neural coding 



1. Introduction 



Shannon's discrete entropy appears as a basic statistic in many fields, from probability theory to 
engineering and even ecology. While entropy may best be known as a theoretical quantity, its accurate 
estimation from data is an important step in disparate applications. Entropy is employed in the 



study of information processing in neuroscience ( Barbieri et al. 2004 Rolls et al. 1999 Shlens et al. 



2007 Strong et al. 19981. It is also used in statistics and machine learning for estimating dependency 



structure and inferring causal relations ( Chow and Liu 1968 Schindler et al. 2007), for example in 



molecular biology (Hausser and Strimmer 2009); as a tool in the study of complexity and dynamics 



in physics (Letellier, 2006); and as a measure of diversity in ecology (Chao and Shen, 2003) and 



genetics (Farach et al. 19951. Each of these studies, confronted with data arising from an unknown 



*. EA and IP contributed equally. 
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Figure 1: Graphical model illustrating the ingredients for Bayesian entropy estimation. Arrows 
indicate conditional dependencies between variables, and the gray "plate" denotes multiple 
copies of a random variable (with the number of copies N indicated at bottom). For entropy 
estimation, the joint probability distribution over entropy H, data x = {xj}, discrete distri- 
bution 7r = {71-;}, and parameter 8 factorizes as: p(H,x,ir,9) = p(H\tt)p(x.\tv)p(it\6)p(6). 
Entropy is a deterministic function of 7r, so p(H\ir) = S(H — 7Tj log7Tj). The Bayes least 
squares estimator corresponds to the posterior mean: E[i?|x] = jj p(H\tt)p(tt, 9\x)dn d8. 



discrete distribution, seeks to estimate the entropy rather than the distribution itself. Estimating the 
entropy is much easier than estimating the full distribution. In fact, in many cases, entropy can be 
accurately estimated with fewer samples than the number of distinct symbols. However, entropy 
estimation remains a difficult problem: there is no unbiased estimator for entropy, and the maximum 
likelihood estimator is severely biased for small datasets. Many previous studies have focused upon 



methods for computing and reducing this bias (Grassberger 2008 Miller 1955 Paninski 2003 



Panzeri and Treves. 1996 Strong et al. 1998). In this paper we instead take a Bayesian approach, 
building upon the work of Nemenman et al. ( 2002 ) . Our basic strategy is to place a prior over the 



space of discrete probability distributions, and then perform inference using the induced posterior 
distribution over entropy. (See Fig. [IJ. 

We focus here on the under-sampled regime, where the number of unique symbols observed in 
the data is small in comparison with the unknown (perhaps countably infinite) number of possible 
symbols. The Pitman- Yor process (PYP), a two-parameter generalization of the Dirichlet process 



(DP) (Goldwater et al. 2006 Ishwaran and James| 2003 Pitman and Yor 1997), provides an 



attractive family of priors in this setting, since: (1) the posterior distribution over entropy has 
analytically tractable moments; and (2) distributions drawn from a PYP can exhibit power-law tails, 



a feature commonly observed in data from social, biological, and physical systems (Dudok de Wit 



1999 


Newman 


2005 


Zipf. 


1949) 



Zipf, 1949). We show that a PYP prior with fixed hyperparameters imposes a 



narrow prior distribution over entropy, leading to severe bias and overly narrow posterior credible 
intervals given a small dataset. Our approach, inspired by Nemenman et al. (20021, is to introduce a 
family of mixing measures over Pitman- Yor processes such that the resulting Pitman- Yor Mixture 
(PYM) prior provides an approximately non-informative (i.e., fiat) prior over entropy. The remainder 
of the paper is organized as follows. In Section [2j we introduce the entropy estimation problem and 
review prior work. In Section 3, we introduce the Dirichlet and Pitman- Yor processes and discuss key 
mathematical properties relating to entropy. In Section 4, we introduce a novel entropy estimator 
based on PYM priors and derive several of its theoretical properties. In Section 5, we show compare 
various estimators with applications to data. 
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2. Entropy Estimation 

Consider samples x := {xj}^ =1 drawn iid from an unknown discrete distribution tv := {^j^Lx on 
a finite or (countably) infinite alphabet X, with cardinality A, that is, p(Xj = i) = 7T;. We wish to 
estimate the entropy of tt, 



H(tt) 



A 

£ 

i=l 



TTi log TTi 



(1) 



We are interested in the under-sampled regime, N <C A, where many of the symbols remain 
unobserved. We will see that a naive approach to entropy estimation in this regime results in seriously 
biased estimators, and briefly review approaches for correcting this bias. We then consider Bayesian 
techniques for entropy estimation in general before introducing the NSB method upon which the 
remainder of the article will build. 



2.1 Plugin estimator and bias-correction methods 

Perhaps the most straightforward entropy estimation technique is to estimate the distribution 7T and 
then use the plugin formula ([I]) to evaluate its entropy. The empirical distribution n = (iri, . . . ,tta) 
is computed by normalizing the observed counts n := (ni, . . . , riji) of each symbol, 



= n k /N, 



n k = 



{xi — k} > 



(2) 



for each k € X. Plugging this estimate for 7r into ([!]), we obtain the so-called "plugin" estimator: 

H plugin = - V" 7Ti log 7Ti, (3) 



which is also the maximum-likelihood estimator under discrete (or multinomial) likelihood. Despite 
its simplicity and desirable asymptotic properties, iJ p i ug in exhibits substantial negative bias in the 
undersampled regime. There exists a large literature on methods for removing this bias, much of 
which considers the setting in which A is known and finite. One popular and well-studied method 



involves taking a series expansion of the bias ( Grassberger 2008 Miller 1955 Panzeri and Treves 



1996 Treves and Panzeri, 1995) and then subtracting it from the plugin estimate. Other recent 
proposals include minimizing an upper bound over a class of linear estimators (Paninski 2003), and 



a James-Stein estimator ( Hausser and Strimmer 2009[ ) . Recent work has also considered countably 



infinite alphabets. The coverage-adjusted estimator (CAE) (Chao and Shen, 2003 Vu et al. 2007) 



addresses bias by combining the Horvitz-Thompson estimator with a nonparametric estimate of the 
proportion of total probability mass (the "coverage" ) accounted for by the observed data x. In a 
similar spirit, Zhang (2012) proposed an estimator based on the Good- Turing estimate of population 
size. 



2.2 Bayesian entropy estimation 

The Bayesian approach to entropy estimation involves formulating a prior over distributions w, and 
then turning the crank of Bayesian inference to infer H using the posterior distribution. Bayes' least 
squares (BLS) estimators take the form: 



ff(x) = E[H |x] = J H(n)p(H\Tr)p(ir\x)div, (4) 
where p(tt|x) is the posterior over 7r under some prior p(n) and discrete likelihood p(x|7r), and 

p(H\n) - S(H + J2^ log n), (5) 
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since H is deterministically related to 7r. To the extent that p(tv) expresses our true prior uncertainty 
over the unknown distribution that generated the data, this estimate is optimal (in a least-squares 
sense), and the corresponding credible intervals capture our uncertainty about H given the data. For 
distributions with known finite alphabet size A, the Dirichlet distribution provides an obvious choice 
of prior due to its conjugacy with the categorical distribution. It takes the form 



(6) 



for 7r on the ^.-dimensional simplex (jri > 1, 7Ti = 1), where a > is a "concentration" parameter 
(Hutter 2002). Many previously proposed estimators can be viewed as Bayesian under a Dirichlet 



prior with particular fixed choice of a. See Hausser and Strimmer (2009) for a historical overview of 



entropy estimators arising from specific choices of a. 



2.3 Nemenman-Shafee-Bialek (NSB) estimator 

In a seminal paper, Nemenman et aL"| (2002) showed that for finite distributions with known A, 
Dirichlet priors with fixed a impose a narrow prior distribution over entropy. In the undersampled 
regime, Bayesian estimators based on such priors are severely biased. Moreover, they have undesirably 
narrow posterior credible intervals, reflecting narrow prior uncertainty rather than strong evidence 
from the data. (These estimators generally give incorrect answers with high confidence!). To address 
this problem, Nemenman et al. (2002) suggested a mixture-of-Dirichlets prior: 



p(ir) 



PDir(7r|a)p(a)(ia, 



(7) 



where pD ll: (Tv\a) denotes a Dir(a) prior on ir, and p(a) denotes a set of mixing weights, given by 



p(a) oc —E[H\a] = AtpAAa + 1) - Ma + 1), 
da 



(8) 



where E[/f|a] denotes the expected value of H under a Dir(a) prior, and tpi{-) denotes the tri-gamma 
function. To the extent that p(H\a) resembles a delta function, ^ and ^ imply a uniform prior 
for H on [0, log A]. The BLS estimator under the NSB prior can be written: 



H ns b = E[ff|x] = / / H(ir)p(ir\x, a) p(a|x) d7r da 



„. , , p(x.\a)v(a) , 
E[H x, a P '/V da, 
p(x) 



(9) 



where E[H\x., a] is the posterior mean under a Dir(a) prior, and p(x\a) denotes the evidence, which 
has a Polya distribution ( |Minka[ |2003[ ) : 



p(x|a) = / p(x|7r)p(7r|o;)(i7r 
(N\)T(Aa) 



T(a) A T(N + Aa) 11 n 

i — 1 



n 



r(rij + a) 



(10) 



The NSB estimate H ns b and its posterior variance are fast to compute via ID numerical integration 
in a using closed-form expressions for the first two moments of the posterior distribution of H given 



a. The forms for these moments are discussed in Nemenman et al. (2002); Wolpert and Wolf (1995) 
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but the full formulae are not explicitly shown. Here we state the results: 



E[#|x, a] = MN + 1) - jMni + 1) 



ELfPlx, al 



n % n k 



E 



(fii + l)n 



- / 



(11) 
(12) 



(N + 1)N 

hk = (M^k + 1) - MN + 2)) (V> (n; + 1) - + 2)) - ^(JV + 2) 
Ji = [Mni + 2) - Vo(W + 2)) 2 + ^(f* + 2) - M N + 2), 

where hi — rii + a are counts plus prior "pseudocount" a, N = ^fii is the total of counts plus 
pseudocounts, and ip n is the polygamma of n-th order (i.e., ipo is the digamma function). Finally, 
var[i/|n, a] = E[H 2 \n, a] — E[iJ|n, a] 2 . We derive these formulae in the Appendix, and in addition 
provide an alternative derivation using a size-biased sampling formulae discussed in Section [3] 

2.4 Asymptotic NSB estimator 

Nemenman et al. have proposed an extension of the NSB estimator to countably infinite distributions 
(or distributions with unknown cardi nality), using a zeroth order approximation to H ns b in the limit 



A 



which we refer to as H, 



nsb^ 



( Nemenman 



2011 



Nemenman et al. 



2004), 



H nsboa = 2 log(A0 + MN - K) - Vo(l) - log(2), (13) 
where K is the number of distinct symbols in the sample. Unfortunately, Hnsb^ increases unboundedly 



with N (as noted by Vu et al. (2007[)), and performs poorly for the examples we consider. 



3. Dirichlet and Pitman- Yor Process Priors 

To construct a prior over unknown or countably infinite discrete distributions, we borrow tools from 
nonparametric Bayesian statistics. The Dirichlet Process (DP) and Pitman- Yor process (PYP) define 



stochastic processes w hose samples are countably infinite discrete distributions (Kingman, 1975 
Pitman and Yor |1997). A sample from a DP or PYP may be written as YliLi % i^4>a where tt = {iTi} 



denotes a countably infinite set of 'weights' on a set of atoms {</>;} drawn from some base probability 
measure, where S ( f >i denotes a delta function on the atom 0i[^|We use DP and PYP to define a prior 
distribution on the infinite-dimensional simplex. The prior distribution over tt under the DP or PYP 
is technically called the GEM distribution or the two-parameter Poisson-Dirichlet distribution, but 
we will abuse terminology by referring to both the process and its associated weight distribution by 



the same symbol, DP or PY (Ishwaran and Zarepour 2002). The DP distribution over iz results from 



a limit of the (finite) Dirichlet distribution where alphabet size grows and concentration parameter 
shrinks: A — > oo and a — > s.t. aA — > a' . The PYP distribution over 7r generalizes the DP to 
allow power-law tails (and includes DP as a special case) (Kingman 1975| Pitman and Yor 1997[) 



For PY(d, a) with d ^ 0, the tails approximately follow a power-law: W{ oc (i) d (pp. 867, ( |Pitman 
and Yor 1997] )) pjMany natural phenomena such as city size, language, spike responses, etc., also 



exhibit power-law tails (Newman 2005 Zipf 1949). Fig. [2] shows two such examples, along with a 
sample drawn from the best-fitting DP and PYP distributions. Let PY(d, a) denote the PYP with 
discount parameter d and concentration parameter a (also called the "Dirichlet parameter"), for 
d 6 [0, 1), a > —d. When d — 0, this reduces to the Dirichlet process, DP(a). We can draw samples 



Here, we will assume the base measure is non-atomic, so that the atoms <f>i's are distinct with probability one. 
This allows us to ignore the base measure, making entropy of the distribution equal to the entropy of the weights 



2. Note that the power-law exponent is given incorrectly in (Gold water et al 



2006 



Teh 



2006) 
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7r PY(d, a) from an infinite sequence of independent Beta random variables in a process known as 



"stick-breaking" (Ishwaran and James 20011: 



i-l 



Pi ~ Beta(l - d,a + id), tt 4 = - /3j)/3j 

3=1 



(14) 



where 7Tj is known as the i'th size-biased permutation from tt (Pitman 1996). The ffj sampled in this 
manner are not strictly decreasing, but decrease on average such that 2j=i tt-j = 1 with probability 1 
(Pitman and Yorl 11997b. 



3.1 Expectations over DP and PYP priors 

A key virtue of PYP priors for our purposes is a mathematical property called invariance under 
size-biased sampling, which allows us to convert expectations over 7r on the infinite-dimensional 
simplex (which are required for computing the mean and variance of H given data) into one- or 
two-dimensional integrals with respect to the distribution of the first two size- biased samples ( Perman 
eFaH [19921 |Pitman| [l996| . 



Proposition 1 (Expectations with first two size-biased samples) For n ~ PY(d,a), 



E 



(7r|d,a) 



= E 



(#i|d,a) 



E 



(ir\d,a) 



= E 



(ifi ,#2 \d,a) 



9(^1,^2) 



7T17T2 



(l-7f X ) 



(15) 
(16) 



where tti and 112 are the first two size-biased samples from tv. 



The first result (15) appears in (Pitman and Yor 1997), and an analogous proof can be constructed 



for (16) (see Appendix). The direct consequence of this lemma is that the first two moments of H(n) 
under the PYP and DP priors have closed forms, 



E[H\d,a] = 
vax[H\d, a] = 



ip (a+l)-ip (l-d), 
a + d 1 — d 



(a + 1) 2 (1 -d) 
The derivation can be found in the appendix. 



1 



^(2-d)-V>i(2 + a). 



(17) 
(18) 



3.2 Expectations over DP and PYP posteriors 

A useful property of PYP priors (for multinomial observations) is that the posterior p(7r|x, d, a) 
takes the form of a mixture of a Dirichlet distribution (over the observed symbols) and a Pitman- Yor 
process (over the unobserved symbols) (Ishwaran and James , |2003[ ) . This makes the integrals over 
the infinite-dimensional simplex tractable, and as a result we obtain closed form solutions for the 
posterior mean and variance of H. Let K be the number of unique symbols observed in N samples, 
i.e., K = X)j=i l{rti>o}- Further, let ctj = nt — d, N = Yl n i> an d A = ^2 a i — 12i n i — = N — Kd. 



Now, following ( Ishwaran and Zarepour 2002 ) we write the posterior as an infinite random vector 
T^post = (Pi,P2,P3, ■ ■ ■ ,PK,P*ir), where 



(jPl,P2,---,PK,P*) 
7V := (7Tl , 7T 2 , 7T 3 , . . . ) 



Dir(ni — d, . . . , nx 
PY{d,a + Kd). 



d 7 a + Kd) 



(19) 
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wordcount (n) 



Figure 2: Frequency distributions of words in natural language (above) and neural spike patterns 
(below). We compare samples from the DP (red) and PYP (blue) priors for two datasets 
with heavy tails (black). In both cases, we compare the empirical CDF estimated from data 
to distributions drawn from DP and PYP using the ML values of a and (d, a), respectively. 
For both datasets, PYP better captures the heavy-tailed behavior of the data. Top: 
Frequency of N = 217826 words in the novel Moby Dick by Herman Melville. Bottom: 
Frequencies among N — 1.2 x 10 6 neural spike words from 27 simultaneously-recorded 
retinal ganglion cells, binarized and binned at 10ms. 



The posterior mean E[H\x,d,a] is given by, 

a + Kd 



E[H\a,d,x] = i>o(a + N + 1) 



a + N 



^o(l - d) 



1 



a + N 



K 



1) 



(20) 



The variance, var[_ff |x, d, a], also has an analytic closed form which is fast to compute. As we discuss 
in detail in Appendix A. 4 var[if |x, d, a] may be expressed in terms of the first two moments of p„, ir 7 
and p = (pi, . . . ,pk) appearing in the posterior (19). Applying the law of total variance and using 
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the independence properties of the posterior, we find: 

var[# \d, a] = E p , [(1 - p*) 2 ] var[ff(p)] + E p „ [pi] var[i? (tt)] 

p 7T 

+ E Pr [n 2 (p,)}-E p ,[n( P ,)} 2 1 (21) 

where 0(p*) := (1 -p*)E p [fl"(p)] + p*E w [if(-Tr)] + /i(p*). To specify fi(p*), we let A := E p [H(p)}, 
B := E„ [H(tt)} so that, 

E[0] := E p , [1 - p,]Ep [H(p)] + E p , [p,]E w [#(*■)] + hfa), 
E[fl 2 } := 2E Pr [p»ft(p»)][B - A] + 2AE p „ [hip*)} + E p , [h 2 {p,)\ 

+ E p . [pi] [B 2 - 2AB] + 2Ep, [p,]AB + E p , [(1 - p*) 2 ]A 2 . 

4. Entropy inference under DP and PYP priors 



The posterior expectations computed in Section 3.2 provide a class of entropy estimators for dis 



tributions with countably-infinite support. For each choice of (d, a), E[H\a, d, x] is the posterior 



mean under a PY{d, a) prior, analygous to the fixed-a Dirichlet priors discussed in Section 2.2 
Unfortunately, fixed PY(d, a) priors also carry the same difficulties as fixed Dirichlet priors. A 
fixed-parameter PY(d, a) prior on 7r results in a highly concentrated prior distribution on entropy 
(Fig. [3]). We address this problem by introducing a mixture prior p(d, a) on PY(d, a) under which the 
implied prior on entropy is flat [^] We then define the BLS entropy estimator under this mixture prior, 
the Pitman- Yor Mixture (PYM) estimator, and discuss some of its theoretical properties. Finally, 
we turn to the computation of PYM, discussing methods for sampling, and numerical quadrature 
integration. 

4.1 Pitman- Yor process mixture (PYM) prior 



One way of constructing a flat mixture prior is to follow the approach of 



Nemenman et al. (2002) by 



setting p(d, a) proportional to the derivative of the expected entropy ( 17). Unlike NSB, we have two 
parameters through which to control the prior expected entropy. For instance, large prior (expected) 
entropies can arise either from large values of a (as in the DP) or from values of d near 1 (see Fig.[3]\). 
We can explicitly control this trade-off by reparametrizing PYP as follows, 

k = M a + D-Ml-d), ,= ,^^7^ ' (22) 

ip Q {a + 1) - Vo(l - d) 



where h > is equal to the expected prior entropy ( 17), and 7 G [0, 00) captures prior beliefs about 
tail behavior (Fig. |4j\). For 7 = 0, we have the DP (i.e., d = 0, giving tt with exponential tails), while 
for 7 = 1 we have a PY(d, 0) process (i.e., a = 0, yielding 7r with power-law tails). Where required, 
the inverse transformation to standard PY parameters is given by: a — 4>o~ 1 (Ml ~ l) + ^o(l)) ~ 1> 
d = 1 — V-"o 1 C0o(l) ~ ^7) 7 where "0o _1 (-) denotes the inverse digamma function. We can construct 
an (approximately) flat improper distribution over H on [0, 00] by setting p(h, 7) = 9(7) for all h, 
where q is any density on [0, 00). We call this the Pitman- Yor process mixture (PYM) prior. The 
induced prior on entropy is thus: 

p(H) = J J p{H |ff)p(7r| 7) fc)p(7, h) d 7 dh, (23) 

3. Notice, however, that by constructing a flat prior on entropy, we do not obtain an objective prior. Here, we 
are not interested in estimating the underlying high-dimensional probabilities {iri}, but rather in estimating a 
single statistic. An objective prior on the model parameters is not necessarily optimal for estimating entropy: 
entropy is not a parameter in our model. In fact, Jeflerys' prior for multinomial observations is exactly a Dirichlet 
distribution with a fixed a = 1/2. As mentioned in the text, such Bayesian priors heavily bias entropy estimates. 
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Prior Mean 



Prior Uncertainty 




10° 

a 



d=0.9 
d=0.8 
d=0.7 
d=0.6 
d=0.5 
d=0.4 
d=0.3 
d=0.2 
d=0.1 
d=0.0 



10 



.10 



Figure 3: Prior entropy mean and variance (17) and 18 as a function of a and d. Note that entropy 



is approximiately linear in log a. For large values of a, p(H (iz)\d, a) is highly concentrated 
around the mean. 



where p(7r|7, h) denotes a PYP on 7T with parameters 7, h. The ability to adapt (7(7) to a given 
problem greatly enhances PYM's flexibility. The PYM mixture priors resulting from two different 
choices of 9(7) are both approximately flat on H, but each favors distributions with different 
tail behavior. Fig. [4jB shows samples from this prior under three different choices of 9(7), for h 
uniform on [0, 3]. For the experiments, we use q(j) = cxp(— y^)1{ 7<1 } which yields good results by 
weighting less on extremely heavy-tailed distributions. Combined with the likelihood, the posterior 
p(d, a|x) cx p(x\d, a)p(d, a) quickly concentrates as more data are given, as demonstrated in Fig. [5j 

4.2 The Pitman- Yor Mixture Entropy Estimator 

Now that we have determined a prior on the infinite simplex, we turn to the problem of inference 
given observations x. The Bayes least squares entropy estimator under the mixture prior p(d, a), the 
Pitman- Yor Mixture (PYM) estimator, takes the form 

H PYM = E[H|x] = / E[H\x, d, a] P(*|rf,a)p(d,a) d(d a) (24) 

J K x ) 



where E[iJ|x, d, a] is the expected posterior entropy for a fixed {d,a) (see Section 3.2). The quantity 
p(x\d,a) is the evidence, given by 



(n*7 V + Id)) (]!*=! r(n, - d)) r(i + a, 

p ^ a ^ - r(i - W + *) ■ (25) 

We can obtain posterior credible intervals regions for -ffpYM by estimating the posterior variance 



E[(H — /fpYM) 2 |x]. The estimate takes the same form as (24), except that we replace E[iJ|x, d, a] 
with var[_ff |x, d, a] in the integrand. 
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£^ (standard params) 

E[H\d, a] 

1 



(new params) 

E[H\7,h] 




10 20 

log a 



h 



B 



0.1 



X 

"£ 0.05 



nnnnnilnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 

DP(a) 




2 3 
Entropy (H) 

Figure 4: Prior over expected entropy under Pitman- Yor process prior. (A) Left: expected entropy 
as a function of the natural parameters (d, a). Right: expected entropy as a function 
of transformed parameters (h, 7). (B) Sampled prior distributions (N — 5e3) over 
entropy implied by three different PYP priors: PY(d, 0) (red), PY(d, a) (grey), and 
PY(0, a) = DP(a) (blue). We show the distributions only on the range from to 5 
nats; sampling from PY becomes prohibitively expensive with increasing expected entropy, 
especially as d — > 1. 



4.3 Computation 

Due to the improperness of the prior p(d, a) and the requirement of integrating over all a > 
(eq. (24)), it is not obvious that the PYM estimate Hpym is computationally tractable. In this 
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Sample Size: 50 




Sample Size: 200 



Sample Size: 1000 




100 200 300 



100 200 300 



100 200 300 



s- 0.2 




Figure 5: Convergence of p(d, a|x) for increasing sample size. Both parameterization (d, a) and (7, h) 
are shown. Data are simulated from a PY(0.25, 40) whose parameters are indicated by the 
red dot. 



section we discuss techniques for efficient and accurate computation of -ffpYM ■ First, we outline 
a compressed data representation we call the "multiplicities" representation, which substantially 
reduces computational cost. Then, we outline a fast method for performing the numerical integration 
over a suitable range of a and d. 

4.3.1 Multiplicities 

Computation of the expected entropy E[iJ|x, d, a] can be carried out more efficiently using a 
representation in terms of multiplicities (also known as the empirical histogram distribution function 



(Paninski 2003)), the number of symbols that have occurred with a given frequency in the sample. 



Letting rrik = \{i ■ rii = k}\ denote the total number of symbols with exactly fc observations in the 
sample gives the compressed statistic m = [mo, m ii • ■ • 7 mm] , where M is the largest number of 
samples for any symbol. Note that the inner product [0, 1, ... , M] T m = N, is the total number of 
samples. The multiplicities representation significantly reduces the time and space complexity of our 
computations for most datasets, as we need only compute sums and products involving the number 
symbols with distinct frequencies (at most M), rather than the total number of symbols K. In 
practice we compute all expressions not explicitly involving ir using the multiplcities representation. 
For instance, in terms of the multiplicities the evidence takes the compressed form, 

p(x\d,a) =p(m 1 ,...,m M \d,a) 

_ r(i + a)nfLT 1 (« + ^) tt ( r(i-d) \ m * Ml 



n 



T(a + N) 11 \^!r(l - d)J rriiV 



11 



Archer, Park, and Pillow 



4.3.2 Heuristic for Integral Computation 



In principle the PYM integral over a is supported on the range [0, oo). In practice, however, the 
posterior is concentrated on a relatively small region of parameter speace. It is generally unncessary 
to consider the full integral over a semi-infinite domain. Instead, we select a subregion of [0, 1] x [0, oo) 
which supports the posterior up to e probability mass. We illustrate the concentration of the posterior 
visually in figure [5j We compute the hessian at the MAP parameter value, (cZmap, qmap)- Using the 
inverse hessian as the covariance of a Gaussian approximation to the posterior, we select the grid 
which spans ±6 std. We use numerial integration (Gauss-Legendre quadrature) on this region to 
compute the integral. When the hessian is rank-deficient (which may occur, for instance, when the 
qmap = or c?map = 0), we use Gauss-Legendre quadature perform the integral in d over [0, 1), but 



employ a Fourier- Chebyshev numerical quadrature routine to integrate a over [0, oo) (Boyd 19871 



4.4 Sampling the full posterior over H 

The closed-form expressions for the conditional moments derived in the previous section allow us 
to compute PYM and its variance by 2-dimensional numerical integraton. PYM's posterior mean 
and variance provide essentially a Gaussian approximation to the posterior, and corresponding 
credible regions. However, in some situations (see Fig. [6]), variance-based credible intervals are a poor 
approximation to the true posterior credible intervals. In such situations we may wish to examine 
the full posterior distribution over H. In what follows we describe methods for exactly sampling 
the posterior and argue that the posterior variance provides a good approximation to the true 



credible interval in many situations. Stick-breaking, as described by ( 14 ) , provides a straightforward 




Monte Carlo Samples 
Gaussian Approximation 



0.1 0.2 0.3 0.4 0.5 
Entropy 




Entropy 



Figure 6: The posterior distributions of entropy for two datasets of 100 samples drawn from distinct 
distributions and the Gaussian approximation to each distribution based upon the posterior 
mean and variance, (left): Simulated example with low entropy. Notice that the true 
posterior is highly asymmetric, and that the Gaussian approximation does not respect 
the positivity of H. (right) Simulated example with higher entropy. The Gaussian 
approximation is a much closer approximation to the true distribution. 



algorithm for sampling distributions ir ~ PY(d, a). With large enough N s , stick-breaking samples 
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approximate n to arbitrary accuracjj^] Even so, in practice sampling 7r is difficult when it is 
heavy-tailed. When sampling tt ~ PY(d, a) for d near 1, where 7r is likely to be heavy-tailed, N s 
may need to be intractably large to assure ^2 i>N tt% < e. In such sitations, truncation may result in 
serverely biased samples of entropy. We address this problem by directly estimating the entropy of 
the tail, PY (d, a + N e d), using (17). As shown in Fig. |3j the prior variance of PY becomes arbitrarily 
small as for large a. For sampling, N s need only be large enough to make the variance of the tail 
entropy small. The resulting sample is the entropy of the (finite) samples plus the expected entropy 
of the tail, H(tt*) + E[H\d, a + Kd}^ Sampling entropy is most useful for very small amounts of data 
drawn from distributions with low expected entropy. In Fig. [5] we illustrate the posterior distributions 
of entropy in two simulated experiments. In general, as the expected entropy and sample size increase, 
the posterior becomes more approximately Gaussian. 



5. Theoretical properties of PYM 

Having defined PYM and discussed its practical computation, we now consider first establish 



conditions under which (24) is defined (that is, finite), and also prove some basic facts about its 
asymptotic properties. While -ffpYM is a Bayesian estimator, we wish to build connection to the 
literature by showing frequentist properties. Note that the prior expectation E[H] does not exist for 
the improper prior defined above, since p(h = E[£T]) cx 1 on [0, oo]. It is therefore reasonable to ask 
what conditions on the data are sufficient to obtain finite posterior expectation -£?pym = -B[-ff|x]. 
We give an answer to this question in the following short proposition (proofs of all statements may 
be found in the appendix), 

Theorem 2 Given a fixed dataset x of N samples, Hpym < oo for any prior distribution p(d, a) if 
N -K > 2. 

In other words, we require 2 coincidences in the data for Hpym to be finite. When no coincidences 
have occurred in x, we have no evidence regarding the support of the 7r, and our resulting entropy 
estimate is unbounded. In fact, in the absence of coincidences, no entropy estimator can give a 
reasonable estimate without prior knowledge or assumptions about A. Concerns about inadequate 
numbers of coincidences are peculiar to the undersampled regime; as we collect more data, we 
will almost surely observe each letter infinitely often. We now turn to asymptotic considerations, 
establishing consistency of Hpym in the limit of large N for a broad class of distributions. It is 
known that the plugin is consistent for any distribution (finite or countably infinite), although the 



rate of convergence can be arbitrarily slow ( Antos and Kontoyiannis 2001 1. Therefore, we establish 



consistency by showing asymptotic convergence to the plugin estimator. For clarity, we explicitly 
denote a quantity's dependence upon sample size N by a introducing a subscript. Thus, x and K 
become xjy and Kn, respectively. As a first step, we show that E[if|xjv, d, a] converges to the plugin 
estimator. 

Theorem 3 Assuming xjv drawn from a fixed, finite or countably infinite discrete distribution n 
such that A 0, 

\E[H\ XN ,d,a]-E[H plugiD \x N }\ A 
The assumption K^/N — > is more general than it may seem. For any discrete distribution it 



holds that K N -> E[K N ] a.s., and E[K N ]/N -> a.s. (Gnedin et al. 2007), and so K N /N -> 



in probability for an arbitrary distribution. As a result, (20) shares its asymptotic behavior with 



H piugin, in particular consistency. As ( |20[ ) is consistent for each value of a and d, it is intuitively 

4. Bounds on N s , the number of samples (i.e., sticks), necessary to reach e on average are provided in jlshwaran and| 
|James||2001[ |. 

5. Due to the generality of the expecation formula \15\ , this method may be applied to sample the distributions of 
other additive functionals of PY. 
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plausible that PYM, as a mixture of such values, should be consistent as well. However, while (20) 
alone is well-behaved, it is not clear that PYM should be. Since E[ff |x, d, a] — > oo as a — > oo, care 
must be taken when integrating over p(d, a\x). Our main consistency result is, 

Theorem 4 For any proper prior or bounded improper prior p(d, a), if data xjy are drawn from a 
fixed, countably infinite discrete distribution tt such that for some constant C > 0, K N = o(7V 1 ~ 1 / c ) 
in probability, then 

\E[H\x N }-E[H plugin \x N }\^0 
Intuitivel y, the asymptot ic behavior of Kn /N is tightly related to the tail behavior of the distribu- 



tion ( |Gnedin et al. 2007). In particular, K n ~ cN a with < a < 1 if and only if TTi ~ c'i* where c 
and d are constants (Gnedin et al. 2007). The class of distributions such that = o{N 1 ^ 1 ^ c ) 
a.s. includes the class of power-law or thinner tailed distributions, i.e., 7Tj = 0(i a ) for some a > 1. 
We conclude this section with some remarks on the role of the prior in Theorem [4] as well as the 
significance of asymptotic results in general. While consistency is an important property for any 
estimator, we emphasize that PYM is designed to address the undersampled regime. Indeed , since 
-Hpiugin is consistent and has an optimal rate of convergence for a large class of distributions ( Antos 
and Kontoyiannis 2001 Vu et al. 2007 Zhang 2012), asymptotic properties provide little reason 
to use Hpym- Nevertheless, notice that Theorem [4] makes very weak assumptions about p(d,a). In 
particular, the result is not dependant upon the form of the PYM prior introduced in the previous 
section: it holds for any probability distribution p(d, a), or even a bounded improper prior. Thus, we 
can view Theorem [4] as a statement about a class of PYM estimators. Almost any prior we choose 
on (d, a) results in a consistent estimator of entropy. 



6. Results 



We compare -ffpyM to other proposed entropy estimators using several example datasets. Each 
plot in Figs [7j [HJ |9j and |10| shows convergence as well as small sample performance. We compare 
our estimators, DPM (d — only) and PYM (ffpyM), with other enumerable-support estimators: 



coverage-adjusted estimator (CAE) ( |Chao and Shen| |2003| |Vu et al 
section 



2007], asympto tic NSB (ANSB, 
2.4[) (|Nemenman[ |20lij), James-Stein (JS) (|Hausser and Strimmer 2009), Grassberger's 



asymptotic bias correction (GR08) (Grassberger 2008| ), and Good- Turing estimator ( |Zhang 2012) 



Note that like ANSB, DPM is an asymptotic (Poisson-Dirichlet) limit of NSB, and hence behaves 
close to NSB assuming a large number of symbols. We also compare with plugin ^ and a standard 
bias correction methods assuming finite support: Miller-Maddow bias correction (MiMa) (Miller 



1955). To make comparisons more straightforward, we do not apply jackknife-based bias correction 
to any of the estimators. PYM performs well as expected when the data are truly generated by a 
Pitman- Yor process (Fig. [7|. Credible intervals for DPM tend to be smaller than PYM, although 
both shrink quickly (indicating high confidence). When the tail of the distribution is exponentially 
decaying, [d = case; Fig. [7] top), DPM shows slightly improved performance. When the tail has 
a strong power-law decay, (Fig. [7] bottom), PYM performs better than DPM. Most of the other 
estimators are consistently biased down, with the exception of JS and ANSB. Although Pitman- Yor 
process PY(d, a) has a power-law tail controlled by d, the high probability portion is modulated 
by a, and does not strictly folllow a power-law distribution as a whole. In Fig. [HJ we evaluate the 
performance for pi cx i~ 2 and pi cx z -15 . PYM and DPM has slight negative bias, but the credible 
interval covers the true entropy for all sample sizes. For small sample sizes, most estimators are 
negatively biased, again except for JS and ANSB (which does not show up in the plot since it is 
severely biased upwards). Notably CAE performs very well in moderate sample sizes. In Fig. |9j we 
compute the entropy per word of in the novel Moby Dick by Herman Melville, and entropy per time 
bin of a population of retinal ganglion cells from monkey retina (Pillow et al. 2005 ). These real- world 
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PY (0.1, 1000) 
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Figure 7: Comparison of estimators on stick-breaking distributions. Poisson-Dirichlet distribution 
with (d = 0,a= 1000) (top), (d = 0.1, a = 1000) (middle), (d = 0.4, a = 100) (bottom). 
We compare our estimators (DP, PYM) with other enumerable support estimators (CAE, 
ANSB, JS, Zhang, GR08), and finite support estimators (plugin, MiMa). Solid lines are 
averaged over 10 realizations. Shaded area represent two standard deviations credible 
intervals averaged over 10 realizations. 



15 



Archer, Park, and Pillow 



Power-law with [exponent 2] 



_ 2 

^2 
ra 

_c 

>. 1.5 

Q. 
O 

m 1 



10 60 300 10000 



Power-law with [exponent 1 .5] 

2 



CO 

I 1.5 

>, 

Q. 
O 

HI 



0.5 

10 60 300 2500 15000 100000 
# of samples 

Figure 8: Comparison of estimators on power-law distributions. 
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Figure 9: Comparison of estimators on real data sets. 
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datasets have heavy, approximately power-law tails as pointed out earlier in Fig. [2] For Moby Dick, 
PYM slightly overestimates, while DPM slightly underestimates, yet either method is closer to the 
entropy estimated by the full data available than other estimators. DPM is overly confident (its 
credible interval is too narrow), while PYM becomes overly confident with more data. The neural 
data were preprocessed to be a binarized response (10 ms time bins) of 8 simultaneously recorded 
off-response retinal ganglion cells. PYM, DPM, and CAE all perform well on this dataset, with both 
PYM and DPM bracketing the asymptotic value with their credible intervals. Finally, we applied the 
denumerable support estimators to finite support distributions (Fig. 10). The power-law p n oc rtT 1 
has the heaviest tail among the simulations we consider, but notice that it does not define a proper 
distribution (the probability mass does not integrate), and so we use a truncated 1/n distribution 
with the first 1000 symbols (Fig. 10 top). Initially PYM shows the least bias, but DPM provides 
a better estimate for increasing sample size. Notice, however, that for both estimates the credible 
intervals consistently cover the true entropy. Interestingly, the finite support estimators perform 
poorly compared to DPM, CAE and PYM. For the uniform distribution over 1000 symbols, both 



DPM and PYM have slight upward bias, while CAE shows almost perfect performance (Fig. 10 
middle). For Poisson distribution, a theoretically enumerable support distribution on the natural 
number, the tail decays so quickly that the effective support (due to machine precision) is very small 
(26 in this case). All the estimators, with the exception of JS and ANSB, work quite well. Note that 
JS performs poorly for both uniform and Poisson distribution (it shows severe upward biased). The 
novel Moby Dick provides the most challenging data: no estimator seems to have converged, even 
with the full data. Surprisingly, the Good- Turing estimator ( Zhang||2012 ) tends to perform similarly 
to the Grassberger and Miller-Maddow bias-correction methods. Among such the bias-correction 
methods, Grassberger's method tended to show the best performance, outperforming Zhang's method. 



7. Conclusion 

In this paper we introduced PYM, a novel entropy estimator for distributions with unknown support. 
We derived analytic forms for the conditional mean and variance of entropy under a DP and PY 
prior for fixed parameters. Inspired by the work of ( Nemenman et alTj 2002), we defined a novel PY 



mixture prior, PYM, which implies an approximately flat prior on entropy. PYM addresses two major 
issues with NSB: its dependence on knowledge of A and its inability (inherited from the Dirichlet 
distribution) to account for the heavy-tailed distributions which abound in biological and other 
natural data. We have shown that PYM performs well in comparison to other entropy estimators, 
and indicated its practicality in example applications to data. A MATLAB implementation of the 
PYM estimator is available at https://github.com/pillowlab/PYMentropy. 



Appendix A. Derivations of Dirichlet and PY moments 

In this Appendix we present as propositions a number of technical moment derivations used in the 
text. 

6. We emphasize that we use the term "power-law" in a heuristic, descriptive sense only. We did not fit explicit 
power-law models to the datasets in questions, and neither do we rely upon the properties of power-law distributions 
in our analyses. 
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Figure 10: Comparison of estimators on finite support distributions. Black solid line indicates the 
true entropy. Poisson distribution (A = e) has a countably infinite tail, but a very thin 
one — all probability mass was concentrated in 26 symbols within machine precision. 
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A.l Mean entropy of finite Dirichlet 



Proposition 5 (Replica trick for entropy ( Wolpert and Wolf 1995| )) Font ~ Dir(ai, o 2l . . . , c\a)> 



such that X)i=i a i = A, and letting a = (ax, 02, ■ ■ ■ , ola), we have 



A 



-ClE[H(ir)\a] = ^{A + 1) - ^ ^(a ( + 1) 



(26) 



Proof First, let c be the normalizer of Dirichlet, c = ^p|^" J - and let L denote the Laplace transform 
(on 7T to s). Now, 



cE[H\a] 



-E/ iog2^)^te^-i)II 7r ? _lrf7r 



E 

E 
E 



d(Oi) 

d(a.i) 



Ct-j — 1 \ 



r(^ + i)n^r( Qj 



(1) 
(1) 



r(a t + 1) 
r(£ fe K) + 1) 



n%) 



./i 



n 3 r( Qi ) 
r(A) 



A. 2 Variance entropy of finite Dirichlet 

We derive E[if 2 (7r)|a]. In practice we compute var[if(7r)|a] = E[£f 2 (7r)|d<] - E[iJ(-7r)|a] 2 . 

Proposition 6 For tv ~ Dir(ai, 02, ... , suc/i that a i — A, and letting a — (ai, ct^, . . . , ctjC), 
we have 

E[H (n)\a] = E (j4 + + E (A + ^ ( 27 ) 

= (^oK + 1) - ^o(^ + 2)) + 1) 

-V>o(^ + 2))-Vi(^ + 2) 
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Ji = (M^ + 2) - M A + 2)f + Vi(c*i + 2) 
-MA + 2) 

Proof We can evaluate the second moment in a manner similar to the mean entropy above. First, 
we split the second moment into square and cross terms. To evaluate the integral over the cross 
terms, we apply the "replica trick" twice. Letting c be the normalizer of Dirichlct, c = ^p|^" 3 - we 
have, 

cE[H 2 \a}= J (-J2^og 2 ^ £(£ i ^-l)n 7r ?" ld7r 

= w K 2 io g2 ^<K£^-i)n<'~ ld7r 

i J 3 

+ E l0g2 **) ^ k l0g 2 7r * ; ) ^tei 71 "* _ !) II ^^d-K 

i^k J j 

= e/ cw^to vn^-'dn 

i j^i 

+ E / W" lQ S2 Ti) W* log2 TTfc) 5(E 4 7Ti - 1) I] 
i^fc j<Z{i,k} 

Assuming i ^ k, these will be the cross terms. 

/ (*» log 2 Ki)(-K k log 2 7r fc )(5(y: i 7r i - l)]j7T" 3_1 c?7r 
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[(^oK + l)-Vo(A + 2)) 
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Summing over all terms and adding the cross and square terms, we recover the desired expression for 
E[ff 2 (7r)|a]. ■ 



A. 3 Prior entropy mean and variance under PY 

We derive the prior entropy mean and variance of a PY distribution with fixed parameters a and d, 
EttIHItt) d,a] and var 7T [if (7r)|d, a]. We first prove our Proposition [l] (mentioned in ( |Pitman and 



p(TTi\a)dTTi which 



Yor 1997)). This proposition establishes the identity E J2i=i f( w i) 
will allow us to compute expectations over PY using only the distribution of the first size biased 
sample, tt\. 

Proof [Proof of Proposition [l] First we validate ( 15 ). Writing out the general form of the size-biased 
sample, 

oo 

p(7T a = x\n) = ^ M{x - 71*), 



we see that 



E #1 



f(x) 



p(wi — x)dx 



= E 7 



f(x) 



p{tti = x\tz) 



dx 



22 



Bayesian Entropy Estimation for Countable Discrete Distributions 



E 7 



} J TTid{x - 7Ti) 



— E, 



i=i 

l oo 



dx 



ft*?-* 



X — TTi)d,X 



1=1 

oo „X 



E 

.2 = 1 
OO 

£/(- 



/(*) 



7Tj<J(a; — TTi)dx 



where the interchange of sums and integrals is justified by Fubini's theorem. A similar method 



the sum inside the expectation on the left hand side of (16 1, 



validates (|T6|). We will need the second size-biased sample in addition to the first. We begin with 

(28) 
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(31) 



where the joint distribution of size biased samples is given by, 

p(7Ti = 7r j; ,7r 2 = ttj) = p(%i = n)p{^2 = = ^i) 



1 — 7T< 



As this identity is defined for any additive functional / of 7r; we can employ it to compute the first 
two moments of entropy. For PYP (and DP when d = 0), the first size-biased sample is distributed 
according to: 

7fi ~ Beta(l - d, a + d) (32) 
Proposition [l] gives the mean entropy directly. Taking f(x) = — xlog(x) we have, 

E[J?(7r)|d,a] = -E a [log(7n)] =^ (a + l)-^o(l-d), 

The same method may be used to obtain the prior variance, although the computation is more 
involved. For the variance, we will need the second size-biased sample in addition to the first. The 
second size-biased sample is given by, 



7r 2 = (1 — 7Ti)i>2, v 2 ~ Beta(l — d, a + 2d) 



(33) 



23 



Archer, Park, and Pillow 



We will compute the second moment explicitly, splitting H(tz) 2 into square and cross terms, 

2 



E[(H(ir)f \d,a] = E 
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J2 io g(^)) 



d, a 



+ E 
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(34) 
(35) 



The first term follows directly from (j 1 5|) 
E 



d, a 
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(36) 



The second term of ([35]), requires the first two size biased samples, and follows from (|16j) with 
g(x,y) — log(a;) log(y). For the PYP prior, it is easier to integrate on V\ and V 2 , rather than 
the size biased samples. The second term is then (note that we let 7 = B~ l {l -d,a + 2d) and 
t = B- 1 (l-d,a + d)), 

E[E pog(7fi)log(7rs)(l-7ri)|ir]|a] 
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Finally combining the terms, the variance of the entropy under PYP prior is 
var[if (7r)|d, a] = 

j [(^o(2 - d) - ^o(2 + a)) 2 + Vi(2 - d) - Vi(2 + a)] 
a + d 
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We note that the expectations over the finite Dirichlet may also be derived using this formula by 
letting the 7r be the first size-biased sample of a finite Dirichlet on Aji . 

A.4 Posterior Moments of PYP 

First, we discuss the form of the PYP posterior, and introduce independence properties that will be 
important in our derivation of the mean. We recall that the PYP posterior, 7r post , of (19) has three 
stochastically independent components: Bernoulli p*, PY 7r, and Dirichlet p. 

Component expectations: From the above derivations for expectations under the PYP and 
Dirichlet distributions as well as the Beta integral identities (see e.g., (Archer et al. 2012 1), we find 
expressions for E p [H(p)\d, a], [H(Tv)\d, a], and E Pt [H(p*)]. 



E[H(n)\d,a] = ^ {a + 1) - ip (l - d) 
E p .[H(p.)]=M<* + N + V " + A "' 



a + N 



Ma + Kd+l) 



N - Kd 
a + N 



iP (N-Kd+l) 



K 



E p [H(p)\d,a] = MN- Kd + 1)-J2 



N -Kd 



ipo{rii - d+ 1) 



where by a slight abuse of notation we define the entropy of as H{p Sf ) = — (1 — p*)log(l — 
p*) — p* logp*. We use these expectations below in our computation of the final posterior integral. 
Derivation of posterior mean: We now derive the analytic form of the posterior mean, |20|). 



E[H(w post )\d,a\ = E 



A" 



= E 



= E 



-^Pi l°g Pi ~P* l0gP*7T. 

OO 

-(1 -p*) log(l -p*) -p^^TT, logTTj -p„ bgp* 

i 

-(i-».)Et^->» 6 



d, a 



E 



E 



= E 



E 



1-P. 

OO 

~P* X! n i lo § n i +H{p*) d,a 

i=l 
oo 

-P* X! n * lo S n i + H (P*) 

1=1 

(1 - p,)H{p) + p,H(n) + H{p,) 



d, a 



d, a 



= E p , [(1 -p*)E p [H(p)\d,a] + p,E^ [H(ir)\d,a] + H(p*)} 

using the formulae for E p [H(p)\d, a], E„. [H(iv)\d, a], and E Pt [H(p*)] and rearranging terms, we 
obtain (pOl), 



E[H(TT post )\d,a] 



A 



a + N 



E p [H(p)} 
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Kd 



N 



E w [if(7r)]+E p . [H(p*)] 



A 

a + N 

a + Kd 



^(A + lJ-^-^ofo + l) 



a + N 
ipo{a + N + 1) 



A 

[ip (a + Kd + l)-ip { 1 - d )} + 
a + Kd 



A 



M<* + Kd+l)- —-MA + 1) 
a + N a + N 



ipo(a + N + l) 



ij; {a + N + 1) 



Kd 



a + N 

A 

a + N 

a + Kd 



K 



Yl ^oioii + 1) 



a + N 
1 



i=l 

Vb(l-d)- 

K 



a + N 



^(ni - d)ip (rii -d+1) 



i=l 



Derivation of posterior variance: We continue the notation from the subsection above. In order 
to exploit the independence properties of 7r p0 st we first apply the law of total variance to obtain ( 39 1 , 



var[F(7r po st)|d, a] = var E 7r)P [F(7r pos t)] 



d, a 



var[iJ(7r post )] 



(39) 



We now seek expressions for each term in (39 1 in terms of the expectations already derived. Step 1: 



For the right-hand term of ( 39 1 , we use the independence properties of 7r pos t to express the variance 
in terms of PYP, Dirichlet, and Beta variances, 



= E 



var[#(7r post )|p*] 



7T,p 



d, a 



(40) 



p. 



(1 - P *) 2 vai-[H(p)} +^var[ff(7r)] 



d, a 



(N-Kd)(N-Kd+l) 
(a + N)(a + N + l) T [g(p)] 
(a + Kd)(a + Kd+l) 
(a + N)(a + N + 1) iv 1 y > l 



(41) 



Step 2: In the left-hand term of ( 39 ) the variance is with respect to the Beta distribution, while the 



inner expecation is precisely the posterior mean we derived above. Expanding, we obtain, 



var 



E w ,p[i?(7T post )|p* 



d. a 



= var (1 -p,)Ep [H(p)} +p*E 7r [H(ir)\p*\ + h(p*) 



d, a 



(42) 
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To evaluate this integral, we introduce some new notation, 



so that 



and we note that 



B 



= E P MP)} 
= E n [H{n)} 

= (1 - p„)Ep [H(p)} + p t E„ [H(ir)] + hfa) 
= (1 -p*)A+p*B + /i(p*) 



n 2 { P *) = 2p m h(p m )\B - A] + 2Ah(p m ) + h 2 {p,) 
+ pl[B 2 - 2AB] + 2p,AB + (1 - p*) 2 A 2 



var 

p. 



E w ,p[-H"(7r p0 8t)b* 



d, a 



E Pt [n 2 (p,)}~E Pt [n( Plt )f 



(43) 



(44) 



The components composing E p> [f2(p*)], as well as each term of (43) can be found in (Archer et al 



2012). Although less elegant than the posterior mean, the expressions derived above permit us to 



compute ( 39 ) numerically from its component expecatations, without sampling. 



Appendix B. Proof of Proposition [2] 

In this Appendix we give a proof for Proposition [2] 
Proof PYM is given by 



Hpym 



p( x ) Jo 



if( < j,a)J , ( x Mj a )p(d, a) da dd 



where we have written H(d.a) : = E[.ff|d, a, x]. Note that p(pc\d,a) is the evidence, given by (25). We 
will assume p{d, a) — 1 for all a and d to show conditions under which H(d,a) is integrable for any 
prior. Using the identity r ^"^ = n™=i( :r + * — 1) an d the log convexity of the Gamma function we 
have, 

j i . . tt T(m — d) T(a + K) 
p(x\d,a) < — V c 



< 



r(rij - d) 1 



r(i - d) a N - K 

Since d € [0, 1), we have from the properties of the digamma function, 



</>o(l-G0=^o(2-d) 
and thus the upper bound, 



1-d 



> ^o(l) 



1-d 



-7- 



#(tf,a) <^o(a + ^+l) + 
if 



a + Kd f 1 

7 + 



1 



a + A 



a + TV V ' ' 1-d 
y^(^i - d)ip (rii - d + 1) 



1-d' 



(45) 
(46) 
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Although second term is unbounded in d notice that ^(l-dj = Tl7=i (* — d)\ thus, so long as 
N — K > 1, ff( a) d)f>(x|d, a) is integrable in d. For the integral over alpha, it suffices to choose 
a N and consider the tail, H( d ^p(x\d, a)p(d, a) da. From (45 1 and the asymptotic expansion 
ip(x) = log (a;) — ^ — + O(-^z) as x — > oo we see that in the limit of a ^> iV, 

-ff(d.a) <l0g(o + iV + 2) + -, 

a 

where c is a constant depending on if, N, and d. Thus, we have 
#(d,cOK x l d > a)p(d, a) da 

n^rK-d) 



< 



and so 77 



(d,a; 



r(i - d) 

is integrable in a so long as N 



log (a - 
K > 2. 



TV- 



~ da 



Appendix C. Proofs of Consistency Results 
Proof [proof of Theorem [3] We have, 



lim E[7J \a, d, xjv] 

A— >oo 



= lim 

A->oo 



Vo(« + TV+ 1) r^rM^ - d)- 



1 



a + TV 



a + TV 

y^(rij - d)ip {rii - d + 1) 



A' 



lim 



iV- 



Aa 



</>o(a + TV + 1) - 2 ^tfofa - d + 1) 



A, 



lim V -i [^(fN - d + 1) - ^o(a + A + 1)] 



A-s-oo ^— ' AT 
i=i 



although we have made no assumptions about the tail behavior of tt, so long as > 0, E[nfc] = 
E[^"[ = P{ x i — k} — limpf^oo Nwk oo, and we may apply the asymptotic 

expansion ip(x) = log(a;) — ~ — + O(^) as x — > oo to find, 



A 



lim E[i7|a, d, xat] = H ] 



plugin 



We now turn to the proof of consistency for PYM. Although consistency is an intuitively plausible 
property for PYM, due to the form of the estimator our proof involves a rather detailed technical 
argument. Because of this, we break the proof of Theorem [4] into two parts. First, we prove a 
supporting Lemma. 

Lemma 7 If the data xjv have at least two coincidences, and are sampled from a distribution such 
that, for some constant C > 0, = o{N 1 ^ 1 / c ) in probability, the following sequence of integrals 
converge. 

E[i7|a,d,xjvJ ? — ^ dadd — > E[H p i ugm \x N \ 



io Jo 
where c > is an arbitrary constant. 



p(*n) 
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Proof 

Notice first that E[H \a, d, x^v] is monotonically increasing in a, and so 



1 ' E[g|a,d,x jy ] P(X f f|a : d) drul,/ 



a=0 J d=0 



pKif+c pi 

/ / E[H\K N + c,d,x N 

Ja=0 Jd=0 



< 



p(x. N \a,d) 



da dd 



We have that, 



E[H \K N + c, d, xjv] = M^N + c + N+l) 
(l + d)K N +c 



(47) 



K N + N + c 



Vo(l-rf) 



K 



N 



1 \ 



As a consequence of Proposition^ J"j_ (l + d)ip(l — rf) p ^^ dd < oo, and so the second term is 
bounded and controlled by K^/N. We let 

A { d,N):=- ^ d l K » + C M l-d) 
K N + N + c 

and, since limjv^oo Jd=o ^(^' ^0 ^t^^v = 0, we focus on the remaining terms of (47). We also 
let Bn) := J2i=i (^^aT 1 1°S (lv)): an ^ note tnat um JV->-oo B = # p i ug in- We hnd that, 

E[H\K N + c, d,x N ] 

< log(iV + K N + c + 1) + A(d, N) 



E 



n, — 1 
Kn + N + c 



log(ni) 



log(iV + i^jv + c + 1) + A(d, N)- 



N 



K N + N + c 
= log [ I 



i=i 
K N + c+l 



N -K 



N 



+ log(iV) 
log |1 + 



N 

2K N + c 
N + K N + c 
K N + c+l 



A(d, N) 



N 



N 



K N + N + c 
A(d, N) 



N 



-B 



log(A0 



1 



2K N + clog{N) 



N 



1 + (K N + c)/N ATi-i/C ATi/c K N + N + c 



B 



plugin 
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As a result, 



a=0 Jd=0 



< 



"plugir 
-^plugin 



K » +c n p( XA r|a,d) 



Q =0 Jd=0 P( x w) 



ddda + o(l) 



For the lower bound, we let H^ a d N ^ := E[H\a, d, x-n}~L[o.k n +c] ( a )- Notice that exp(—H^ ad N ^) < 1, 
so by dominated convergence lmiAr^co E[exp(— H( a .d,N))] = ex P( — -ffpiugin) by Proposition [ij And so 
by Jensen's inequality, 

exp(- lim E[H ia±N) })< lim E[exp(-H( adN) )]=exp(-H p i ngin ) 



A? 



lim E[H( a ^N)] > -ffplugin, 



and the lemma follows. 



We now turn to the proof of our primary consistency result. 

Proof [proof of Theorem [4] 

/ / E[H\a, d, XN ] pi * Nla ; d)p } a > d) dadd 
J J p(xat) 

= P f\[H\a,d,x N ] p{XNla ; d)P ^ d) dadd 
Jo Jo P( X N) 

+ f\[H\ a ,d, XN ] pMa ;^ M dadd 

Ja Jo P{XN) 

If we let ao — K n + 1, by Lemma [7j 

f ° f E[H\a, d, x N } PiXNla ; d)p{a > d) dadd -+ E[if plugin | Xjv ]. 

JO JO P{ X N) 

Therefore, it remains to show that 

I 00 I' E[H\a, d, XN f Ma ! d)p(a > d) dadd -> 
J ao Jo P[xn) 

For finite support distributions where Kpj —> K < 00, this is trivial. Hence, we only consider infinite 
support distributions where Km — > 00. In this case, there exists Nq such that for all N > No, 
p([0, K N + 1], [0, 1)) ^ 0. Since p(a, d) has a decaying tail as a -> 00, 3N VN > N Q , p(K N + 1, d) < I, 
thus, it is sufficient demonstrate convergence under an improper prior p(a, d) — 1. Using, 

E[H\a, d,x N ] < *Po(N + a + 1) < N + a 

we bound 

r [\[H\a,d,x N ] Pix ^ a : d) dadd 

Jao Jo P(*n) 

„ Ja fo( N + l)p{*N\a,d)dadd 



xjv|a, 
p(*n) 



IZSo P{*N\a,d)dadd 
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We focus upon the first term on the RHS since its boundedness implies that of the smaller second 
term. Recall, that p(x) = f°Z. Q J^ =0 p(x\a, d) ddda. We seek an upper bound for the numerator and 
a lower bound for p(xjv). Upper Bound: First we integrate over d to find the upper bound of the 
numerator. (For the following display only we let 7(0?) = ^JlS ^( n i ~ d)j ■ 



00 1* 1 

/ (N + a- l)p(xjv|a,d)dadd 

ao JO 

(UtT\ a + Id)) l(d)T(l + a)(N + a - 1) 
, ao J d =o T(l~d)^T(a + N) ddda 

-y d=o r(l-d)^ J ao T(a + N) 

Fortunately, the first integral on d will cancel with a term from the lower bound of p(xpf). UsingM 

(N+a-l)r{a+K N ) _ Bcta,(a+K N , N- K- 1) 
Y(a+N) ~ V{N-K-l) ' 

(N + a - l)r(a + K) , 

f(^nv) da 



■0 

oo /•! 



r(Ar 


-K 
1 


-1) 


T(N 


-K 
1 


-1) 


T(N 


-if 
1 


-1) 


T(N 


-if 
1 


-1) 



t a+K - x {l-t) N - K - 2 dtda 

1 ^ao + A'-l 



=0 

1 r(o;o + K )Y{N - K - 2) 

T(N-K - 1) r(iV + a - 2) 

r(« + if) 

r(7V + a -2)(iV-if -2) 



Lower Bound: Again, we first integrate d, 

l 

p(x|a, d) dd da 

a=0 Jd=0 

K-l 



1 (Ifer (« + ")) (nf=irK-d))r(i + a) 

d<i da 



o Jd=0 



T(l - d) K T(a + N) 



1 (n£ir(ni-d) 



d=0 



r(i-d)* 7 Q =o r(a + vv) 



7. Note that in the argument for the inequalities we use K rather than Kn for concision. 
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So since F(1+a) - Beta(l+a,jV-l) , 

so, since r(Q+JV) - r(jv-i) > tnen 

r°° a K - l T(l + a) 
T(N-l) - { \ } da 

{ 'J a=0 T(a + N) 

> a K - 1 Bcta(l + a,N -l)da 

Ja=0 
/•oo /*1 

= / t a (l-t) N ~ 2 dtda 
Ja=0 Jt=o 

p 1 /'OO 

= / (l-t) N - 2 / a K -H a dadt 

Jt=0 Ja=0 

=w £ (i ~* )iv " 2iog G) 

>r(A-) / (i-t) N - K - 2 t K dt 

Jt=o 

= r(K)Bet&(N -K-l,K + l) 

where we've used the fact that log(|) _1 > Finally, we obtian the bound, 

f°° a^- 1 r(l + a) T(K)T(N - K - 1)T(K + 1) 

J a=0 T(a + N) a ~ T(N-1)T(N) " 

Now, we apply the upper and lower bounds to bound PYM. We have, 
IZ fo( N + a ~ iXxJvIa, d)dadd 

r{a + K N ) r(jv-i)r(jv) 



< 



(N — K N - 2)T(N + a -2) T(K N )T(N - K N - 1)T(K N + 1) 

1 T(a + K N ) r(JV-l) 

(N — K N — 2) T(K N ) T(N + a - 2) 

UN) 

X T(N -K N - 1)T(K N + 1) 
TV fK N \ a ° N n-i/2 



(N - K N - 2) V N J (N-K N - 1) N - K »-V 2 (K N + l)*~+i/2 
N 2 fK N \ aa f N ^ Ar - 3/2 



(K N + 1)V2(JV -K N -2)\NJ \N - K N - 1 

A' A \ 1 ' ' 



Where we have applied the asymptotic expansion for the Beta function, 

Beta(x, y) ~ V27T- 



(x + y) 



x+y- 
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a consequence of Stirling's formula. Finally, we take ao := K jv + (C + l)/2 so that the limit becomes, 

N {K N ^ {C+1)/2 



~* K 1/2 V N 

N v 
R C/2 

_ N 



which tends to with increasing N since, by assumption, K N — o{N 1 ^ 1 / c ). 
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